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Abstract 

This paper proposes a variant of the normalized cut algorithm for spectral clustering. Al¬ 
though the normalized cut algorithm applies the K-means algorithm to the eigenvectors of a 
normalized graph Laplacian for finding clusters, our algorithm instead uses a minimum volume 
enclosing ellipsoid for them. We show that the algorithm shares similarity with the ellipsoidal 
rounding algorithm for separable nonnegative matrix factorization. Our theoretical insight im¬ 
plies that the algorithm can serve as a bridge between spectral clustering and separable NMF. 
The K-means algorithm has the issues in that the choice of initial points affects the construction 
of clusters and certain choices result in poor clustering performance. The normalized cut algo¬ 
rithm inherits these issues since K-means is incorporated in it, whereas the algorithm proposed 
here does not. An empirical study is presented to examine the performance of the algorithm. 


1 Introduction 

Clustering is a task of dividing a data set into groups on the basis of similarities between pairs 
of data points. The task is to find groups of data points such that similar data points are in the 
same group and dissimilar ones in different groups. Here, the groups found by an algorithm for 
the clustering task are referred to as clusters. Spectral clustering is a graph-based clustering, and 
the eigenvalues and eigenvectors of the graph Laplacian play a central role. 

In spectral clustering, we construct a weighted graph to represent the similarities of data points. 
The vertices correspond to data points, and the edges are associated with weights. The weights 
are determined by a similarity function that quantifies the similarity of two data points; it takes 
on a large positive value if the data points are similar, while it gets close to a zero value if they 
are dissimilar. Small weights are usually ignored when constructing the graph since they may not 
make a major contribution to the configuration of clusters. For each data point, we pick up some 
data points with high similarity with it, and put edges with positive weights between them. An 
input parameter p, called the neighbor number, determines how many data points are chosen. 

In the weighted graph, the clustering task is to divide the vertex set into groups such that the 
total edge weight in the same groups is large, while those among different groups are small. Shi 
and Malik in [22 1 introduced a normalized cut function to formulate this task. The function assigns 
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a nonnegative real number to the groups of vertices, and it reaches its minimum when the task is 
completed. The ideal goal is to find groups of vertices that minimize a normalized cut function. 

However, finding the optimal solution of the normalized cut minimization problem is hard. We 
instead solve a relaxation problem formed by dropping hard constraints. This is an eigenvalue 
problem for a normalized graph Laplacian. The eigenvalues and eigenvectors of the Laplacian 
contain clues for finding the optimal solution of the normalized cut minimization problem. Thus, 
we attempt to find clusters by using them. The normalized cut algorithm proposed by Shi and 
Malik in m and Ng et al. in m applies a K-means algorithm to the eigenvectors. We will use 
the abbreviation NC to denote this algorithm for short. 

Instead of K-means in NC, we propose to use the minimum volume enclosing ellipsoid for 
the eigenvectors of normalized graph Laplacian. The computation of such an ellipsoid can be 
formulated as a convex optimization problem, and there are efficient algorithms for solving it. Our 
algorithm computes the enclosing ellipsoid and chooses some points lying on the boundary by using 
the successive projection algorithm (SPA) of pTTJ - The points can be thought of as representatives 
of the clusters. Hence, our algorithm assigns data points to the representative points on the base of 
their contribution. The assignment can be formulated as a convex optimization problem. Figure [2] 
of Section [L2l illustrates the algorithm. 

We see in Theorem [T| that the algorithm has a similarity to the ellipsoidal rounding (ER) 
algorithm for separable nonnegative matrix factorization (NMF) in [20] when the neighbor number 
p is set to be equal to the number of data points, in other words, when all weights are taken 
into account in constructing the graph. Strictly speaking, the final outputs returned by these 
two algorithms do not coincide. However, we see in Corollary [T] that the outputs coincide if we 
modify one step of ER. Accordingly, our algorithm can be thought of as an extension of ER. A 
separable NMF is a special case of NMF, and it has applications in clustering and topic extraction 
of documents B El ED] and in endmember detection of hyperspectral images mm- It is a 
matrix factorization problem, and basically differs in purpose from spectral clustering. However, 
the theoretical insights shown here imply that our algorithm can serve as a bridge between spectral 
clustering and separable NMF through neighbor number p. 

The K-means algorithm has the issues in that the choice of initial points is sensitive to the way 
clusters are constructed; some choices yield good clustering performance, while others do not. It is 
difficult to choose good initial points before running the algorithm and the choice affects the cluster 
construction. Hence, NC inherits the issues. There have been many studies indicating that NMF 
based clustering within the block coordinate descent (BCD) framework has a good performance. 
However, the framework has similar issues as K-means. Our algorithm does not have these issues, 
since it consists of solving an eigenvalue problem and convex optimization problems and performing 
SPA. 

We conducted experiments evaluating the performance of our algorithm on real image data 
sets and compared its results with those of existing algorithms, including NC and NMF. We 
set multiple initial points for the existing algorithms and measured the worst, best, and average 
performance. The experimental results showed that the performance of our algorithm is higher 
than the average performance of the existing algorithms in almost all cases. We observed that 
there are initial points that result in the existing algorithms having poor performance, and as 
a result, their average performance gets worse. We also conducted experiments to see how the 
performance our algorithm varies with the neighbor number p. 

The rest of this paper is organized as follows. After introducing the notation and symbols, we 
review the NC algorithm in Section [2j Our algorithm is presented in Section [3l and its connection 
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with the ER algorithm is shown in Section |4j We mention the issues of initial point choice in NC 
and ER in Section 0, and review related work in order to discuss the relationship between spectral 
clustering and NMF in Section [6] An empirical study is reported in Section [71 Finally, concluding 
remarks are given in Section [H 

1.1 Notation and Symbols 

We use R dxm and to denote the set of d-hy-m real matrices and d-hy-m nonnegative matrices. 

Here, a nonnegative matrix is a real matrix whose elements are all nonnegative. We also use § m 
to denote the set of m-by-m real symmetric matrices. Let A be a matrix of proper size. The 
symbol A T represents its transpose. The symbols tr(A) and rank(A) represent the trace and 
rank. We use e and e* to denote a vector of all ones and an ith unit vector. We use I to denote 
an identity matrix. The symbol diag(ai,..., a m ) represents an m-by-m diagonal matrix such that 
diagonal elements are a\,..., a m . Let A\ and A 2 be a d-by-m± matrix and a d-by- 771,2 matrix. We 
use (Ai, A 2 ) to denote the horizontal concatenation of the two matrices and the matrix size is 
d-by-(mi + m 2 ). For a set S , the symbol |<S| represents the number of elements, and the symbol 
S c is the complementary set. 


2 Review of Normalized Cut Algorithm for Spectral Clustering 

We denote m data points by d-dimensional vectors a\,..., a m , and its set by S. Consider r 
subsets of S. If the subsets are disjoint and its union coincides with S, we call the subsets disjoint 
partitions of S. In this paper, we consider clustering algorithms to return the disjoint partitions of 
S, and call the disjoint partitions returned by them clusters of data points. Spectral clustering is 
a graph-based clustering, and the algorithm is based on the eigenvalue decomposition of a graph 
Laplacian. There are some types of algorithms proposed for spectral clustering. In particular, the 
NC algorithm by Shi and Malik of |22j and Ng et al. of [21] is popular and often used. For the 
details of algorithms and history in spectral clustering, we refer the reader to the survey paper 
[26J. Below, we review the NC algorithm. 

In spectral clustering, we set a function / on a weighted graph G, and formulate a clustering 
task as a problem of minimizing /. A weighted graph is a graph such that each edge is associated 
with a weight. Let V and E denote the sets of the vertices and edges. The weight value is given 
by a function k from V x V to a set of nonnegative real numbers. An edge e € E links two vertices 
Vi and Vj if the value of k at e is positive; otherwise, it does not link. For an edge e t j between two 
vertices Vi and Vj, let k^ denote the value of k at e- L j . Consider a subset S of vertex set V. We 
use the notation cut(5,5 c ) to denote the total weight of all edges between S and its complement 
S c . Namely, 

cut (S,S C ) = ^ ^ k v- 

ieS,jeS c 

In the same manner to [26], hereinafter, we use a shorthand notation i £ S. This notation 
represents the indices i of vertices Vi in S. A degree of vertex is the total weight of all edges 
connected to Vi , and we denote it by di. We use the notation vol(<S) to denote the total degree of 
all vertices in S. Namely, 

vol(<S) = Y di 
ies 
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where di is a degree of vertex Vi in S and it is given as di = kij for the weights kij on edges 

etj. The vol(<S) can be regarded as the size of S. 

The NC algorithm consists of three major steps. The first step constructs a weighted graph G. 
A vertex m € V is in one-to-one correspondence with a data point di € S. Let k be a function that 
quantifies the similarity of two data points. The function assigns a nonnegative real number as the 
similarity of data points Gq and dj] it takes on a large positive value if the data points are similar, 
while it gets close to a zero value if they are dissimilar. In the context of spectral clustering, the 
function is referred to as a similarity function. A polynomial function and a Gaussian function are 
popular and often used as the function. In particular, this paper deals with the former function, 
and its form is 

k(ai, dj) = (ajcij + b) c (1) 

where b is a nonnegative real number and c is a positive integer number. These are parameters 
given in advance. 

Small weight values in G may not make a major contribution to the configuration of clusters, 
and thus, are usually ignored. A p-nearest neighbor set is used for this purpose. For a data point 
di, we choose the top p data points with high similarity to a* as measured by a similarity function 
k, and construct a set by collecting these points. Let M p {di) denote the set. The integer number 
p used for the construction is referred to as a neighbor number. The weight value kij of G is given 
as 

k(di, dj), if di € A f p {dj) or dj € N p (di), 

0, otherwise. 

We denote the m-by-m symmetric matrix consisting of k tJ by K. The matrix K is called a weighted 
adjacency matrix of G, and in this paper, it is referred to as an adjacency matrix for short. 

The next step finds clusters from the graph G built in the first step. A graph Laplacian is a 
matrix so as to possess some properties of G. To see the form of this matrix, we need to introduce 
a degree matrix. A degree matrix of G is a diagonal matrix whose diagonal element is the degree 
of each vertex. Namely, the {i, i)th element is given as di = ^2jL\ kij where k n j is an element of 
the adjacency matrix K of G. We denote the m-by-m diagonal matrix by D. A degree matrix is 
usually assumed to be nonsingular. The singularity means that some data point is isolated and is 
completely dissimilar to all the other data points. Thus, such a data point should be removed. A 
graph Laplacian of G is a matrix given as D — K for a degree matrix D and an adjacency matrix 
K. This is an m-by-m symmetric matrix, and we denote it by L. 

We set a function / to find the clusters from G. Although some choices are possible, we consider 
the function for the disjoint partitions S\,... ,S r of S, 

c _ \ ^ cut(«Sj,<Sf) 

, ... ,O r ) — > i / c s 

T^i vol ( 5 *) 

This function is called a normalized cut. Other types of functions have been proposed in this 
context. For instance, a ratio cut function [12] is as popular as the normalized cut one. The 
cut(<S ? , Sf) takes a small value if the data points in S and S c are dissimilar. The vol(«Sj) takes a 
large value if so is the size of S. Therefore, the minimization of the normalized cut function is to 
find the clusters such that the data points in different clusters are dissimilar and the cluster sizes 
are large. 

The ideal goal is to find r disjoint partitions of S minimizing the normalized cut function. To 
be precise, 
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(Spectral clustering by the normalized cut function) Suppose that we are given 
a data set S and an integer number r. Find the disjoint partitions Si,... ,S r of S to 
minimize the normalized cut function / of ([2]). 


The optimal solution of the minimization problem provides the best clusters in the sense that it 
attains the minimum of /. But, this is a hard combinatorial problem, and thus we consider its 
relaxation problem. By using a graph Laplacian L , we rewrite / as 

f = tr(H T LH). 

Here, H is an m-by-r matrix determined by <Si,... ,<S r , and its element hjj is 


h .. = [ l /\fvo\{Sj), ateSj, 

13 1 0, otherwise. 


(3) 


Therefore, the minimization problem of / is equivalent to the problem, 

P : minimize tr (H LH) subject to H satisfies ([3]). 

The hardness of this problem is in the constraint for H. Hence, we drop the hard constraint. 
Instead, we take into account that H satisfies the relation H T DH = I and add it as the constraint. 
Namely, we consider the problem, 

Q : minimize tr (HLH) subject to HDH = I. 

The problem Q serves as the relaxation problem of P. It can be solved through eigenvalue decom¬ 
position. By introducing a new matrix variable G € R mxr such that G = D 12 H , we transform 
the problem into an equivalent one, 

Qo : minimize tr (G T D~ x ^ 2 LD~ X / 2 G) subject to G G = I. 

In this transformation, we use the assumption that a degree matrix D is nonsingular, and in other 
words, the diagonal elements are all positive. The optimal value and solution of Q 0 is obtained 
from the eigenvalue decomposition of normalized graph Laplacian D 12 LD _1 / 2 . We arrange 
the eigenvalues in ascending order, and let denote the values. Namely, the relation 

Ai < ■ ■ ■ < A m holds. Also, let Vi denotes the eigenvector corresponding to the eigenvalue A* for 
i = 1,... ,m. We easily see that the optimal value and optimal solution of Q 0 are given as A and 
V r such that A = Ai + • • • + A r and V r = (v ±,..., v r ). Therefore, those of Q are respectively A and 
D~ x l 2 V r . 

The final step constructs clusters from the optimal solution D~ x / 2 V r of relaxation problem 
Q. We search for the optimal solution H of original problem P based on clues provided by that 
of Q. Let us see the matrix H of © in detail. We take the transpose of H, and denote the 
column vectors of H T by f\,..., f m € M r . The vector /j can be regarded as an indicator to tell 
us which cluster a data point belongs to. The elements of /, are all zero except one element, and 
the position of nonzero element indicates the cluster index to which a data point belongs. The 
convex hull of fi,... , f m is an (r — 1 (-dimensional simplex in M r . Among f 1 ,..., f m , there are r 
different types of vectors and those different ones correspond to the r vertices. 

Let us consider the situation in which the optimal solution D~ l / 2 V r of relaxation problem Q 
is close to the optimal solution H of original one P. In the same way as fi, we take the transpose 
of D~ x l 2 V r and denote the column vectors of VjT D 1 / 2 by p\, ... ,p m € M r . Under this situation, 
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we can have an expectation that the convex hull of pi ,..., p m is similar to the shape of an (r — 1)- 
dimensional simplex, and pi,.. ■ ,p m are located around each vertex of the simplex. Thus, these 
vectors should form r clusters. Accordingly, the clusters can be found by applying a clustering 
algorithm such as K-means to p±,... ,p m . Algorithm Q] describes each step of the algorithm by 
Shi and Malik in [22]. 


Algorithm 1 NC [22] 

Input: A data set S = {ai,... , a m }, a cluster number r, a neighbor number p. and a similarity 
function k. 

Output: Clusters «Si,... ,S r . 

1: Construct an adjacency matrix K € § m by using a similarity function k and a p-nearest 
neighbor set. 

2: Let Z and D be a graph Laplacian and a degree matrix obtained from K. Compute the 
r eigenvectors v\,... ,v r € M m of the normalized graph Laplacian Z ) -1 / 2 LD -V 2 € s m in 
correspondence with the r smallest eigenvalues. 

3: Form a matrix V r = («i,..., v r ), and let pi,... ,p m € M r be the columns of VjJD ~ 1 / 2 . 

4: Apply the K-means algorithm to pi,... ,p m , and find r clusters S\,... ,S r . 


Remark 1. The algorithm by Ng et al. of [21] coincides with that of Shi and Malik of [22j except 
for O^ 1 / 2 in the step 3 of Algorithm [TJ Instead of the matrix, it uses a diagonal matrix S that 
scales each column of V T to have a unit (.2 norm. 


3 Proposed Algorithm 

In this section, we will describe the proposed algorithm. The algorithm is a variant of the NC 
algorithm. Although NC uses K-means for the construction of clusters in the final step, it instead 
uses a minimum volume enclosing ellipsoid (MVEE). 

3.1 Observation for the Optimal Solution of Q 

Our algorithm is built on the following observation. 

Observation 1. Letpi,... ,p m € M r denote the column vectors ofV^D" ~r/ 2 £ R rxm . The convex 
hull of p\,... ,p m has a similar shape to an (r — l)-dimensional simplex in M r , and pi,...,p m are 
around the r vertices. Therefore, these vectors form r clusters in the neighborhood of the vertices. 

As mentioned in the last of Section O this observation should hold if is close to the 

optimal solution H of P. We see in Proposition [T] that the pi,... ,p m € M r lie on a hyperplane, 
and thus, its convex hull is at least an (r — l)-dimensional polytope. In the following description, 
we use the same notation in Section [2] v±,... ,v r € R m are the eigenvectors of normalized graph 
Laplacian Z? -1 / 2 LD^ 1 ^ 2 € § m , and those correspond to each of the eigenvalues ... ,a r with 
Cl < • • • < a r ; pi,... ,p m € M r are the column vectors of V r T D~ 1 / 2 € M rxm with V r = («i,..., v r ) 
given by the v\,... ,v r . 
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Proposition 1. v\ can be chosen as rD 1 ^ 2 e with some nonzero real number t. Thus, if we choose 
v\ as rD l / 2 e, then, pi, ... ,p m lie on a hyperplane TL = {x E R r : ejx = r}. 

Proof. It suffices to show the first part of this statement that V\ can be chosen as rD 1 / 2 e. Let 
L denote a normalized graph Laplacian TV 1 / 2 LD 1 ^ 2 . Since a graph Laplacian L is positive 
semidefinite, so is L. Also, we have Le = 0. Thus, L has a zero eigenvalue as the smallest one, 
and the corresponding eigenspace contains a vector TD 1/l2 e with some nonzero real number r. □ 

Regarding the first part of the proposition, we can find the same statement in Proposition 3 of 

m- 

Remark 2. Although Proposition |T] requires us to choose the vector rD l / 2 e from the eigenspace 
associated with the smallest eigenvalue, eigenvalue solvers such as eig and eigs commands on 
MATLAB do not always output the vector. But, we, of course, can reconstruct it from the 
eigenvectors provided by the solvers. Let £ denote the eigenspace in R m associated with the 
smallest eigenvalue. Suppose that the dimension of £ is s. Since a normalized graph Laplacian is a 
symmetric matrix, we can take s orthonormal basis vectors for £. Let v±,... ,v s € R m denote the s 
orthonormal basis vectors. Then, there exists a nonzero vector p£l s such that rD l ^ 2 e = V s p and 
||p|I 2 = 1 where V s = {y\,... ,v s ) € R mxs since rD l / 2 e € £. We pick up s — 1 orthonormal basis 
vectors pi ,... ,p s -i € R s for the orthogonal complement to p. Let P = (p,pi, .. . ,p s - 1 ) € R sxs . 
We construct V S P € R mxs . Since P is an orthogonal matrix, the s column vectors of V S P are the 
orthonormal basis vectors spanning £, and the first column vector is rD 1//2 e. 

Remark 3. The matrix D 12 in the step 3 of Algorithm Q] can be thought of as a scaling matrix 
that scales the column vectors of to lie on a hyperplane. As mentioned in Remark [T] instead 
of D _1 / 2 , the algorithm by Ng et al. of [21] uses a diagonal matrix S that scales those of to 
lie on a unit cube. 

3.2 MVEE for the Points in Simplex 

Observation [T] implies that p±,,p m € M r form r clusters in the neighborhood of the vertices of 
an (r — l)-dimensional simplex in M r . Thus, if we can fold r vectors from pi, ■ ■ ■ ,p m that are close 
to the vertices of the simplex, those vectors should serve as the representative points of clusters. 
An MVEE for pi,... ,p m can be used for finding such vectors. This is because it can touch the 
near-vertices if the convex hull of pi,... ,p m is similar to the shape of simplex. Originally, this 
geometric property has been shown in mm in order to design algorithms for a separable NMF 
problem. 

In this section, we first see the formulation of MVEE, and then recall the precise description 
of the property in Propositions 0 and [3] Let L be an r-by-r positive definite matrix, and z be 
an r-dimensional vector. A full-dimensional ellipsoid in R r is defined as a set £ = {x G R r : 
(x — z) T L(x — z) < 1}. The vector z serves as the center of ellipsoid £. In particular, an ellipsoid 
is referred to as an origin-centered ellipsoid if the center z matches the origin (in other words, z 
is a zero vector). It is known that the volume of unit ball in R r only depends on the dimension r, 
and we denote it by c(r). The volume of £ is given as c(r)/\fdetL. Let B denote the boundary of 
£ such that B = {x £ R r : (x — z) T L(x — z) = 1}. A vector x is called as an active point of £ if 
it lies on the boundary such that x 6 B. 

Let pi,... ,p m be m points in R r . We put the following assumption on the points. 
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Assumption 1. rank(P) = r for P = (pi ,... , p m ) €E M rxm 


We consider an origin-centered ellipsoid to enclose the convex hull of a set S = {±pi,..., ±p m }, 
and describe several properties of the enclosing ellipsoid with minimum volume. Assumption [T] 
ensures that the convex hull of S is full-dimensional in M r , and thus, the enclosing ellipsoid has 
a positive volume. The volume minimization of enclosing ellipsoid for the convex hull of S is 
formulated as the problem, 

R(5) : minimize — logdetL, 

subject to (pp T , L) < 1 for all p £ <S, 

L >- 0. 

The r-by-r matrix L is the decision matrix variable. The symbol >- for a matrix represents that 
it is positive definite. In the problem R(5), we need to put Assumption [T] to ensure the existence 
of optimal solution. Let L* denote the optimal solution. The set {x € M r : xL*x < 1} is an 
origin-centered enclosing ellipsoid with minimum volume for the convex hull of S. This paper 
refers to it as an origin-centered MVEE for S. The active points of the origin-centered MVEE are 
the points Pi satisfying pjL*pi = 1. The problem R(5) is a convex optimization problem, and 
there are polynomial-time algorithms for solving it. The problem and algorithms for computing 
an MVEE have been well studied, and we refer the reader to, for instance, [61 for the details. 

We describe the propositions mentioned in the first of this section. 

Proposition 2 (Lemma 11 of [20]). Suppose that pi,... ,p m € M r satisfy Assumption 0 Then, 
the origin-centered MVEE for S = {±pi, • • • , ±p m } touches at least r points with plus-minus signs 
among pi,... ,p m - 

Proposition 3 (Proposition 3 and Corollary 4 of (20]). Consider an (r — 1)-dimensional simplex 
in M r . Let pi,... ,p r £ M r be the vertices, and let P = (pi,... ,p r ) € M rxr . 

(a) Suppose that qi ,..., q n € M r belong to the simplex. Construct a set S = {±pi,... , ±p r , ±gi,..., ±g n }- 
Then, the origin-centered MVEE for S only touches the vertices p\,...,p r with plus-minus 
signs. 

(b) Suppose that qi,...,q n € M r are given as qi = Pk, by using k{ £ M r such that ||fcj ||2 < 1- 
Construct a set S = {p\, ... , ±p r , ±Qi, ..., ±q n }. Then, the origin-centered MVEE for S only 
touches the vertices pi,... ,p r of with plus-minus signs. 

The proof for Proposition [3] a) is also found in ([TO] . Proposition (3](b) can be thought of an 
extension of that of Proposition [3]( a) in some sense since qi,...,q n in Proposition [3)( a) can be 
written as q, = Pki by using fcj such that ||/n||i = 1 and fcj > 0. It should be noted that 
Assumption Q] is satisfied in Proposition [3] since pi,... ,p r are the vertices of (r — l)-dimensional 
simplex. 

We see from Proportions [2] and [3] that the following geometric properties hold. Consider a 
finite number of points, including the r vertices, in an (r — l)-dimensional simplex in M r . Then, 
the origin-centered MVEE for the points touches the r vertices. Furthermore, it also holds even if 
some amount of perturbation is added to the points. If the perturbation is large, the ellipsoid can 
touch more that r points. 
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3.3 Algorithm Description 

Recall Observation [Q pi, ■ ■ ■ , p m € M r are the column vectors of V, T D 1 / 2 € M rxm whose trans¬ 
pose is the optimal solution of problem Q. Suppose that Observation [T| holds. Then, the convex 
hull of pi ,...,Pm is similar to the shape of a simplex. Proposition [3] tells us that the MVEE 
for the set S = {±pi,..., ±p m } can touch some vectors among pi,... ,p m that are close to the 
vertices of the simplex. Such vectors should serve as the representative points of clusters. Ac¬ 
cordingly, clusters can be found by assigning pi,... ,p m to the representative points on the base 
of their contribution. Here, we should pay attention to the following issue. Proposition [2] implies 
that the MVEE for the S has a possibility to touch more than r vectors among pi,,p m . The 
algorithm for a separable NMF problem such as SPA PI can be used for selecting r points from 
the candidates. We will explain the problem and algorithms in Section T4.11 

Algorithm 2 NCER 

Input: A data set S = {ai,..., a m }, a cluster number r, a neighbor number p , and a similarity 
function k. 

Output: Clusters S\,... ,S r . 

1: Run steps 1-3 of Algorithm[TJ and let S = {±pi,..., ±p m } where these vectors pi,..., p m € 
M r are the columns of Vj T D~ 1//2 € M rxm . 

2: Compute an origin-centered MVEE for the set S , and find all active points. Let X be the 
index set of the active points. 

3: If \X\ = r, set J as J = X. Otherwise, select r elements from X by SPA, and construct the 
set J of these r elements. 

4: Assign each Pi,... ,p m to any one of Pj, j € J on the base of contribution, and construct 
r clusters Si,... ,S r . 


Our algorithm is presented in Algorithm [2j We denote it by NCER since it is a variant of NC 
and uses an ellipsoid rounding to find clusters in the final step. Below, we explain the details of 
steps 3 and 4. 

We see from Proposition[2]that the index set of active points X constructed in the step 2 contains 
at least r elements since the vectors pi, ... ,p m € M r satisfy rank(P) = r for P = (pi, ... ,p m ) € 
T xm . Therefore, step 3 constructs a set J by setting J = X if \I\ = r; otherwise, selects r 
elements from X, and constructs a set J. SPA can be used for the selection. If the convex hull of 
vectors p*, i € X is similar to the shape of a simplex, SPA can select r vectors from pi, i € X that 
are close to the vertices of the simplex. We refer the reader to Algorithm 1 of mi for the details 
of SPA. 

Step 4 assigns pi,... ,p m to r representative points pj , j € J , and constructs r clusters. The 
assignment is conducted on the base of contribution rate of representative points in generating a 
point. Let P(J) denote anrxr matrix consisting of the representative points pj, j € J. Namely, 
P{J) = (Pj : j € J). Consider some p*, i € {1,..., m}. We define a contribution rate w of the 
representative points to p,; as the optimal solution of the problem, 

minimize || P(J)w — Pi11 2 subject to w > 0. 

The r-dimensional vector w is the decision variable. This is a convex optimization problem, and 
sometimes referred to as a nonnegative least square (NLS) problem. The optimal solution can 
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be obtained by using an active set algorithm. We refer the reader to m for the details of the 
algorithm. Let w* be the optimal solution of the problem. We find the index j of the largest 
element among the r elements of w*, and assign pi to a cluster Sj. 


4 Connection to Separable NMF 

In this section, we will see that NCER for spectral clustering is closely related to ER for separable 
NMF in [20]. In fact, NCER shares similarity with ER when a neighbor number p is set to be 
equal to the number of data points. 

4.1 Separable NMF Problem 

Consider a d-by-m nonnegative matrix A such that 

A = FW for F € M+ Xr and W = (/, K)U € R r + m . (4) 

Here, I is an r-by-r identity matrix, K is an r-by-(m — r) nonnegative matrix, and II is an m-by-m 
permutation matrix. A nonnegative matrix A is called a separable matrix if it can be decomposed 
into F and W of the form (j4]). We call the F a basis matrix and the W a weight matrix. A 
separable NMF problem is a problem of finding the basis and weight matrices from a separable 
matrix. To be precise, 

(Separable NMF problem) Suppose that we are given a separable matrix A of 
the form (JJ]) and an integer number r. Find an index set X with r elements such that 
F = A(X). 

Here, A(X) denotes the submatrix of A which consists of the column vectors with indices in X. 
Namely, A(X) = (a* : i € X) for the column vectors of A. 

A separable NMF is the special case of NMF. As we will mention in Section [5] solving an 
NMF problem is hard in general. As a remedy for the hardness, Arora et al. in HIE] proposed 
to make the assumption called separability on the NMF problem. Then, it turns into a tractable 
problem under the assumption. An NMF problem under a separability assumption is referred to 
as a separable NMF problem. Separable NMF has applications in clustering and topic extraction 
of documents mmm and endmember detection of hyperspectral images mm- 

We shall look at the separable NMF problem from geometric point of view. Since the data 
matrix A is a separable one of the form Q, the conical hull of the column vectors of A is an 
r-dimensional cone in M d . It has r extreme rays, and those correspond to the column vectors of 
basis matrix F. The intersection of the cone with a hyperplane is an (r— 1 (-dimensional simplex in 
R d , and the r vertices correspond to the column vectors of basis matrix F. Figure Q] illustrates the 
geometric interpretation of separable matrix. Hence, a separable NMF problem can be rewritten as 
a problem of finding all vertices of (r — l)-dimensional simplex in M d . Several types of algorithms 
have been proposed for a separable NMF problem. SPA [EE] and XRAY|l5j as well as ER are 
designed on the geometric interpretation of separable matrix. 

It is ideal that an algorithm for a separable NMF problem is robust to perturbation which 
disturbs the separability structure. For a separable matrix A of (QJ and a d-by-m real matrix AT", 
we consider the matrix 

A = A + N. (5) 
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Figure 1: Geometric interpretation of a separable matrix A with ( d,m,r ) = (3,7,3). Black points 
are the intersection points of the column vectors of A with a hyperplane. The region surrounded 
by the black lines represents the convex hull of intersection points, fi, f 2 , and /3 are the column 
vectors of basis matrix F in A. 


We call the A a near-separable matrix since the separability structure of A is disturbed by AT. A 
separable NMF algorithm is said to be robustness if the algorithm can find a matrix close to the 
basis matrix in a near-separable matrix. It has been shown theoretically and practically in [20] 
and mi that ER and SPA are robust algorithms. 

4.2 ER Algorithm 

We give a precise description of ER. As already mentioned, the algorithm is based on the geometric 
interpretation of separable matrix, and Propositions [2] and [3] are the backbone of the algorithm. 
In the application of Proposition [3] to a separable NMF problem, we need to pay attention to the 
following point. The proposition requires that an (r — l)-dimensional simplex is in an r-dimensional 
real space. But, in the problem, it can be in higher dimensional space. Therefore, a dimension 
reduction is required to be performed. 

Algorithm [3] describes each step of ER. In the description, the notation m is still used to denote 
the number of column vectors of B for simplicity although it varies if the zero column vectors of 
B are removed in the step 2. We explain the details of the steps 1 and 2 although the steps 3 and 
4 are the same as the steps 2 and 3 of NCER. 

Step 1 computes the singular value decomposition (SVD) of A. The SVD of A takes the form 

A = uj:v t . ( 6 ) 

U is a d-by-d orthogonal matrix consisting of left singular vectors U\,... ,114 € R d . V is an 
m-by-m orthogonal matrix consisting of right singular vectors V \,..., v m € M m . S is a d-by- 
m diagonal matrix (diag(<7i,..., aj), 0) consisting of singular values oq,..., < 7 ^ with the relation 
(j\ >■■■ > Od- Here, the 0 is a d-by-(m — d) zero matrix. We pick up the r largest singular values 
a 1 ,..., ay, and construct the r-by-r diagonal matrix E r = diag(<7i,..., ay). The matrix U r 'E r V J T 
is known to be the best rank-r approximation matrix to A in measured by matrix 2-norm. Here, 
U r = (111, • • •, u r ) € M. dxr for the r column vectors 111, ..., u r of U, and V r = (iq, ..., v r ) € M mxr 
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Algorithm 3 ER 

Input: A d-hy-m nonnegative matrix A, and a basis number r. 

Output: An index set J. 

1: Perform a dimension reduction of A € M^ xm from d to r and construct a dimensionally 
reduced matrix B € W rxm through SVD. 

2: Remove all of zero column vectors in B if exist. Construct a diagonal matrix S G R mxm 
such that all column vectors of BS lie on a hyperplane XL. Let S = {±gi,..., ±q m } for the 
column vectors qi,..., q m G M r of BS. 

3: Compute an origin-centered MVEE for the set S , and find all active points. Let X be the 
index set of the active points. 

4: If \X\ = r, set J as J = X. Otherwise, select r elements from X by SPA, and construct the 
set J of these elements. 


for those v \,..., v r of V. It should be noted that we have A = UX>V T = U r T. r V^ if A is a 
separable matrix. After the SVD computation of A, step 1 constructs 

B = E r V r T € M rxm . 

We call the matrix B a dimensionally reduced matrix for A. 

Step 2 constructs a diagonal matrix S that scales the column vectors of B to lie on a hyperplane. 
There exists such a hyperplane since B has no zero column vectors. We consider a hyperplane 
XL = {x G M. d : w T x = z }. Here, w is a d-dimensional real vector such that the elements are all 
nonzero and 2 is a nonzero real number. The intersection points of column vectors 6, of B with XL 
are given as Sjbj for s,; = z/w T bi. Thus, in this step, we construct S = diag(si,..., s m ) by using 
the Sj. 

We need to put an assumption on a basis number r which is set as an input parameter in 
advance. Step 3 computes an MVEE for the column vectors qi,. ■ ■, q m of BS G K rxm . In the 
MVEE computation, Assumption [I] needs to be satisfied and the computation is allowed under the 
assumption. Therefore, we put an assumption on a basis number r that r < rank(A) for a data 
matrix A. If the assumption holds, the r largest singular values of A are positive, and thus, we 
have rank(d?S') = r. 

ER has been shown in [1 20] to have the following properties. Let A be a separable matrix 
of the form (]4|). Then, ER for an input data (A,r) with r = rank(A) returns the index set 
J such that F = A{J). Let A be a near-separable matrix of the form (]5]). If N satisfies 
||IV|| 2 < e(F.K), then, ER for an input data (A, r) with r = rank(A) returns the index set J 
such that || F - A{J)\\ < e(F, K). Here, e(F, K) is a nonnegative real number determined by 
the basis matrix F and weight matrix K. For the details, we refer the reader to Theorem 9 of the 
paper. 

Remark 4. The original description of ER in [20] does not contain the step 3 of the above 
algorithm description. If the input data matrix A is a separable one, we can assume without 
loss of generality that the column vectors of A lie on a hyperplane. This is because we have 
ADi = FD 2 FK 2 1 WD\ for nonsingular matrices D\ and D 2 . Hence, we can skip the step 3 if A 
is a separable matrix and the column vectors of A are scaled to lie on a hyperplane in advance. 
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4.3 Connection between NCER and ER Algorithms 

We observe the active point sets constructed in the step 2 of NCER and the step 3 of ER, and 
then, see that the two sets coincide with each other under some assumptions on data points and 
parameters of the algorithms. In the next steps, the algorithms select r elements from the active 
point sets, and construct sets by collecting them. The two sets are different in general. But, we 
see that those sets coincide if we modify one step of ER. 

We put an assumption on data points oi,..., a m <E R d for NCER and ER. 

Assumption 2. Any of m data points a\, ..., a m £ R d are nonnegative and not a zero vector. 

The first part of Assumption [2] is required for ER. In below discussion, we need to handle a 
diagonal matrix 


D = diag(di,..., d m ) € M mxm with d, = aj(ax H-f- a m ) (7) 

for data points ai,...,o m . The second part of Assumption [2] is used to ensure that the D 
is nonsingular. Let A = (a\,... ,a m ) £ M. dxm for the data points We put 

assumptions for the parameters of NCER and ER. 

Assumption 3. For the NCER algorithm, we have the following settings. 

(a) A similarity function in the input is set as k(a,i,aj) = ajaj. 

(b) A neighbor number in the input is set as p = m. 

Assumption 4. Let D be of {?]j. For the ER algorithm, we have the following settings. 

(a) A data matrix in the input is set as AD _1 / 2 . 

(b) A basis number in the input is set as r satisfying r < rank(A). 

(c) A diagonal matrix in the step 2 is set as S = D~ x ! 2 . 

It should be noted that the D in Assumption 0] is nonsingular under Assumption^ Assumption 
[3ja) chooses a polynomial function of (HJ with 6 = 0 and c = 1 as a similarity function. This 
function gives a similarity of two data points as its inner product. Assumption E^b) means not to 
ignore small values of similarity, and take into account all the values in a graph. Assumption [4j a) 
scales the data points a i,..., a m into d 1 7 a\,... ,d m o, m by the diagonal elements d±,... ,d m 
of D. We give a remark on this assumption in the last of this section. We see below that the D 
corresponds to a degree matrix of a graph constructed by the similarity function k and p-nearest 
neighbor set under Assumption [3j As mentioned in the last part of Section 14.21 Assumption H^b) 
ensures to allow us to carry out an MVEE computation in the step 3 of ER. We see below that 
Assumption |4jc) makes the column vectors of dimensionally reduced matrix lie on a hyperplane. 

Theorem 1. Assume that Assumptions [H [3J and0] hold for m input data a ±,..., a m £ R d and 
also the NCER and ER algorithms. Let T\ be the index set of active points in the step 2 of NCER, 
and Z 2 be that in the step 3 of ER. Then, we have T\ = Z 2 . 
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We prove this theorem in the remaining of this section. First, we shall see in detail how vectors 
Pi, ..., Pm € M r in the step 1 of NCER are constructed under Assumptions [2] and El For the 
data points oi,..., a m € M r , we let A = (ai,..., a m ) € M dxm . Under the assumptions, a graph 
Laplacian L for the data points is 

L = D- A t A. (8) 

Here, D is a degree matrix for an adjacency matrix A T A, and its form is given as ([7]) . The normal¬ 
ized graph Laplacian L is of the form I — LV 1 / 2 A T AD^ 1 / 2 . Hence, the eigenvalue decomposition 
of L can be obtained through the SVD of AD -1 / 2 . In a similar way to the description of ©, we 
write it as 

AD~ 1/2 = C/SV T . (9) 

Here, U and V are respectively d-by-d and m-by-m orthogonal matrices, and the column vectors 
are respectively the left and right singular vectors of ALU 1 / 2 . S is a d-by-m diagonal matrix 
(diag(<Ti,..., ad), 0), and a \,..., ad are the singular values of ALU 1 / 2 . The eigenvalue decompo¬ 
sition of L can be written as 


L = D 1/2 LD 1/2 

= I - LU 1 / 2 A t ALU 1 / 2 

= U(J-E t E)U t . (10) 

by using the expressions ((HJ) and Q. We arrange the eigenvalues of L in ascending order, and 
denote the ith eigenvalue by Namely, Ai < • • • < A m holds. Then, from the expression (fTOl) . 
we have 

= f 1 - a 2 , i = 1,... ,d, 

1 \ 1 i = d + 1,..., 77i 

for 1 = a\ > • • • > ad > 0. Here, the relation a i = 1 comes from the fact that a normalized 
graph Laplacian is positive semidefinite and has a zero eigenvalue. The column vector of V 

is the eigenvector of L, and corresponds to the eigenvalue A* and also singular value a,. Let 

V r = (i>i,..., v r ) € M mxr . Under the assumptions, we see that p\,... ,p m in the step 1 of NCER 
are the column vectors of 

P = V r T LU 1 / 2 , (11) 

and the column vectors v\,...,v r oiV r are the r right singular vectors of AD -1 / 2 in correspon¬ 
dence with the r largest singular values a \,..., a r with 1 = a\ > • • • > oy > 0. 

Next, we shall see in detail how vectors q\,.... q m € M r in the step 2 of ER are constructed 
under Assumptions [2] and [4j Under the assumptions, step 1 computes the SVD of AD 1 / 2 and 
constructs a dimensionally reduced matrix B for the matrix. Since the D is equivalent to the degree 
matrix constructed in NCER under Assumptions [2] and [3l the SVD of AD -1 / 2 coincides with that 
of ©. Let V r = (iq,..., v r ) € R mxr for the r right singular vectors v\,...,v r in correspondence 
with the r largest singular values a\,... ,a r such that 1 = ai > ■ ■ ■ > ay > 0. The last strictly 
inequality a r > 0 comes from Assumption IU(b) . Also, let E r = diag(oi,... ,a r ) € M rXr for the r 
singular values a±,... ,a r . Then, the dimensionally reduced matrix B is given as S r V r T € M rxm 
by using those E r and V r . From Assumption IU(b), Step 2 constructs 

Q = T, r V r T D ~ 1/2 € M rxm (12) 

by using D of ([7]). The V r and D coincide with those of (HID . As already shown in (HOD , the V r 
corresponds the r eigenvectors of normalized graph Laplacian for L of ([S]). We see from Proposition 
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D] that. Q of (fl2l) does not have any zero column vectors and the column vectors q±,, q m lie on 
a hyperplane {x £ : ejx = r} for some nonzero real number r. 

The column vectors </i,..., q m of Q in (fT2l) can be thought of as the images of those pi,..., p m 
of P in (11111 under a linear transformation . This transformation is nonsingular since (T\,... ,a r 
are all positive under Assumption [l|b). It is well known that an MVEE computation is invariant 
under a nonsingular linear transformation; see [6] for instance. Therefore, the following proposition 
holds. 

Lemma 1. Let G £ K rxr be a nonsingular matrix. Consider p\,... ,p m ,q\,... ,q m € M r such 
that qi = Gpi for i = 1,..., m. Let X\ be the set of active points in the origin-centered MVEE for 
a set S = {±pi, ..., ±p m } and Z 2 be that for a set T = {±Qi, ■ ■ • , ±q m }. Then, we have T\ = 1%. 

Proof. The origin-centered MVEEs for each of sets S and T are given by the optimal solutions 
of problems R(«S) and R(T). We denote by L* € M rxr and M* € M rxr the optimal solutions of 
R(<S) and R(T). Then, we have L* = G T M*G. Therefore, the active point set of origin-centered 
MVEE for S coincides with that for T. □ 

The proof of Theorem [T] is now obtained. 

Proof of Theorem\ [7] It follows from the above discussion and Lemma [I] □ 

The step 3 of NCER and also the step 4 of ER select r elements from the active point sets, 
and construct the index sets of the elements. Although the two index sets are different in general, 
those coincide if we modify one step of ER. In the step 1, we set B as V r T instead of S r VjT. We 
denote the modified version of ER by MER. We immediately have the following corollary. 

Corollary 1. Assume that Assumptions H 0 and^hold. Let J\ be the set constructed in the 
step 3 of NCER, and J 2 be that in the step 4 of MER. Then, we have J\ = J 2 - 

Accordingly, if we perform MER, and classify q±,..., q rn after the step 4 by following the step 4 
of NCER, then, the obtained clusters coincide with those by NCER under the three assumptions. 
We will give the demonstration in Section 17.41 

Remark 5. Assumption m a) requires us to perform a data scaling in ER. The same data scaling 
can be found in m ■ Empirical results in the paper show that the scaling can enhance the 
performance of NMF based clustering. 


5 Issues in Clustering by K-means and NMF 

We recall the algorithms in K-means and NMF clustering, and then, mention the issues about the 
choice of initial points. An empirical study will be reported to confirm the issues in Subsection 
1731 Given a data set S and an integer number r. Let a denote a data point in S. In K-means, 
we set the function for the disjoint partitions S\,...,S r of S, 

r 

/(Si,...,S r ) = ll a — 11 2 

j = l a£<Sj 
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where 


(13) 


c j = 


SaeS,- a 

-l^- 


and consider the problem of minimizing the function /. The Cj serves as a center of the partition 
Sj. Although, conventionally, the number of partitions is denoted by K in K-means, we continue 
to use r for consistency in this paper. The minimization problem is hard to solve, and in fact, has 
been shown to be NP-hard in US El- Thus, instead of the global optimal solution, we find the local 
solution by using an alternative procedure. We minimize / by alternatively fixing centers C\,... ,c r 
and partitions <Si,... ,S r . When c±, ... ,c r are fixed, a data point a is assigned to Sj having the 
nearest center Cj such that the index j attains the minimum value of ||a — Cj ||| for j = 1 ,..., r. 
When Si,... ,S r are fixed, Cj is computed by following (|13l) . The K-means algorithm repeats the 
procedure until some stopping criteria are satisfied. To start the alternative procedure, it requires 
us to arbitrarily choose C\,... ,c r and fix them in advance. It is empirically known that the choice 
of initial centers is sensitive to the cluster construction; some choices may make it possible to show 
good clustering performance, while others may not. It is difficult to choose good initial points 
before running the algorithm and the choice affects the cluster construction. NC has the same 
issues since K-means is incorporated in it. 


Given a d-by-m nonnegative matrix A and an integer number r. The column vectors of 
A correspond to data points. In NMF, we find a d-hy-r nonnegative matrix F and an r-by-m 
nonnegative matrix W such that the product of F and W is as close to A as possible. A natural 
way for the formulation is that we set the function for matrices F E M. dxr and W € R rxm 

f(F,W) = \\FW- A\\ 2 f (14) 


and consider the problem of minimizing the function / under the nonnegativity constraints on 
F and W. Solving the problem is hard, and has been shown to be NP-hard in [25]. However, 
the intractable problem can be reduced into tractable subproblems if either of F or W is fixed. 
Each subproblem becomes a easily solvable NLS problem. The NMF algorithm repeats to solve 
NLS problems by alternatively fixing F and W so as to minimize / until some stopping criteria 
are satisfied. This alternative procedure can be regarded as a BCD framework which is used for 
minimizing a nonlinear function; see [14] for the details of the framework. After that, the algorithm 
classifies data points a, to r clusters by using the obtained F and W. The assignment is similar 
to the step 4 of NCER. We regard the column vectors of F as the representative points of clusters, 
and the elements of column vectors of W as a contribution rate. For a column vector Wi , we find 
the index j of the largest element in Wj, and assign a data point a* to the jth cluster Sj. The 
NMF algorithm employs a BCD framework, and it requires us to arbitrarily choose F (or W) and 
fix it in advance. As is the case in K-means, the choice of initial matrix is sensitive to the cluster 
construction. 


There are many studies on the efficient implementation of NMF algorithm. The main discussion 
points are in how to solve NLS problems. Although various types of algorithms are proposed, the 
multiplicative update algorithm of m may be the most popular one. Empirical results in [ 48] and 
m imply that a projected gradient algorithm and an active set algorithm are faster and provide 
more accurate solutions than a multiplicative update. We refer the reader to m for further 
discussion. 

We finally mention the GNMF algorithm proposed in [7]. Empirical results in the paper imply 
that GNMF outperforms NC and NMF in clustering. This is a variant of the NMF algorithm. In 
GNMF, we add the regularization term R to the function / of (I14|) . and set the function 

f(F, W) = \\FW — A\\ 2 f + \R(W) (15) 
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where 


- m m 

R(W) = 5EEM-—iiil 

j=l i=l 

= tr {W J LW). 

The problem we consider is to minimize the function / under the nonnegative constraints on F 
and W. Here, kjj are the elements of adjacency matrix K which is constructed for data points 
ai,..., a m , and L is the corresponding graph Laplacian. Also, A is a parameter and takes a 
nonnegative value. The term R is called a graph regularizer. We interpret W as a dimensionally 
reduced data matrix for A. If data points a* and a,j are similar to each other, the term forces 
the dimensionally reduced ones Wi and Wj to be similar. The GNMF algorithm minimizes / by 
using a BCD framework, and solves subproblems by using an algorithm based on the notion of 
the multiplicative update. After that, it applies K-means to the column vectors W \,... ,w m of 
W, and finds r clusters. Accordingly, there are two parts in GNMF that require us to set initial 
points in advance. 


6 Related Work on Connection of Spectral Clustering and NMF 

Some of previous studies discuss the relationship between spectral clustering and NMF. In partic¬ 
ular, the paper [8j points out that a similarity can be found in the problem formulations in spectral 
clustering and NMF. We again consider the normalized cut minimization problem P. Since the ma¬ 
trix variable H are nonnegative, we construct the following relaxation problem by taking account 
of it. 

minimize tr (H T LH) subject to H DH = I and H > 0. 

Under the change of variable G = D 12 H, this is equivalent to 

minimize tr (G T D~ 1 / 2 LD~ 1 ^ 2 G) subject to G T G = I and G > 0 

since D is a diagonal matrix such that the diagonal elements are all positive. Furthermore, 
the object function is tr(J) — tr (G T WG) and we have tr(J) — 2tr (G T WG) + tr (W^W) = 
||GG t - W\\ 2 p. We here let W = D ^WD ' 1 / 2 . Thus, the above relaxation problem is 
essentially equivalent to 

minimize \\GG T — W\\ 2 F subject to G T G = I and G > 0. 

Hence, if we drop the orthogonal constraint G T G = I from there, then, it can be thought of as 
the special case of NMF since W is a nonnegative matrix. However, to the best of the author’s 
knowledge, there may be no work to rigorously discuss the similarity of algorithms for spectral 
clustering and NMF. 


7 Experiments 

We will show an empirical study to investigate the performance of NCER. The first experiments 
visualize the products obtained from NCER for a small data set. In the experiments, we will see 
the relationship between the performance of NCER and Observation [U The second experiments 
compare the performance of NCER with existing algorithms. The third experiments display how 
the performance of NCER varies with the neighbor number p. All experiments were conducted on 
a 3.2 GHz CPU processor and 12 GB memory. 
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7.1 Algorithm Implementation, Data Sets, and Measurements 

All algorithms used in the experiments were implemented on MATLAB. We describe the imple¬ 
mentation details as follows. 


• NCER. Three computation require to be carried out: eigenvalue and eigenvector computa¬ 
tion, MVEE computation, and NLS problem solving. We used MATLAB commands eigs 
and lsqnonneg for the first and third computation. The lsqnonneg employs the active set 
algorithm for an NLS problem. For the second computation, we performed the interior-point 
algorithm in a cutting plane framework. It has been shown empirically in Earn that the use 
of cutting plane accelerates the efficiency of interior-point algorithm for an MVEE problem, 
and makes it possible to handle large problems. We used the software package SDPT3 [23] 
to perform the interior-point algorithm. 

• NC. In addition to eigenvalue and eigenvector computation, K-means requires to be per¬ 
formed. We used a MATLAB command kmeans for performing it and the eigs command. 

• NMF. NLS problems require to be solved in the BCD framework for minimizing / of (I14H . 
As mentioned in Section [5] there are various possibilities for the choice of algorithms for 
solving NLS problems. We used the code available from the author’s website of [18] . The 
code employs the projected gradient algorithm for the problems, and it tends to show better 
clustering performance than others. 

• GNMF. In addition to K-means, the BCD framework for minimizing / of (|15|) requires to 
be performed. We used the code available from the first author’s website of [7] for performing 
it and the kmeans command. 

We used image data sets and document data in the experiments. We give an explanation for 
the data sets as follows. 

• COIL20. This is a data set of 20 object images. The images are taken by turning 
the objects with 360 degree rotation in each 5 degree interval. The data set consists of 
72 images per object, and contains a total of 1440 images. The size of images is 128- 
by-128 pixels with 256 grayscale intensities. The data set is available from the website 
(http://www.cs.columbia.edu/CAVE/software/softlib/coil-20). Although two types 
of data sets, processed and unprocessed ones, are provided, we used the processed version. 

• JAFFE. This is a data set of facial images of 10 Japanese female models. It consists of 7 
different types of facial expressions per model, and contains a total of 213 images. The size 
of images is 256-by-256 pixels with 256 grayscale intensities. The data set is available from 
the website (http://www.kasrl.org/jaffe_info.html). 

• ORL. This is a data set of facial images of 40 human models. The images are taken by 
changing facial expressions and lightning. The data set consists of 10 different types of 
facial images per model, and contains a total of 400 images. The size of images is 112- 
by-92 pixels with 256 grayscale intensities. The data set is available from the website 

(http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html). 


18 


• MNIST. This is a data set of handwritten digits from 0 to 9. The data set is con¬ 
structed by using some part of data sets available from National Institute of Standards 
and Technology. The data set contains a total of 10000 images. The size of images is 
28-by-28 pixels with 256 grayscale intensities. The data set is available from the website 
(http://yann.lecun.com/exdb/mnist). Although two types of data sets, training data set 
and test data set, are provided, we used the test data set. 

• USPS. Along with MNIST, this is also a data set of handwritten digits from 0 to 9. The 
data set contains a total of 9298 images. The size of images is 16-by-16 pixels, and the 
grayscale intensities are scaled into the interval from —1 to 1. The data set is available from 
the website (http://www.gaussianprocess.org/gpml/data). 

• ReutersTOPlO. This data set was constructed by using some part of Reuters-21578 corpus. 
The corpus contains 21578 news articles appeared on the Reuters newswire in 1987, and the 
articles are manually classified into 135 topic groups. The corpus size can be reduced into 
8293 articles in 65 topic groups by discarding the articles belonging to multiple topics. We 
picked up the top 10 largest topic groups from the size-reduced corpus, and constructed a data 
set by collecting all articles in the topic groups. The data set contained 7285 articles with 
18933 words. We denote it by ReutersTOPlO. The Reuters-21578 corpus is available from 
the UCI Knowledge Discovery in Databases Archive (http://kdd.ics.uci.edu). The size- 
reduced corpus is available from the website (http: //www. cad. zju. edu. cn/home/dengcai). 

• BBC. This is a corpus of news articles appeared on the BBC news website in 2004-2005. The 
news articles are chosen from five topic groups, and are preprocessed by applying stemming, 
stop-word removal, and low word frequency filtering. The data set contains 2225 articles with 
9636 words, and is available from the website (http://mlg.ucd.ie/datasets/bbc.html). 

We generated data matrices by using the above data sets. Consider the case of a image data 
set such that it consists of m grayscale images of s-by-t pixels. In this case, we vectorized each 
image data into an (s x f)-dimensional vector, and constructed an (s x f)-by-m matrix by stacking 
the vectors on the columns. The images in all the data sets except USPS have 256 grayscale 
intensities. Thus, the element values of the data matrices ranged from 0 to 255. For USPS data 
set, we shifted the element values of the data matrix by 1 so as to range from 0 to 2. Consider the 
case of a document data set such that it consists of m document with d words. We constructed 
a d-by-m matrix. The elements of the matrix represent the frequency of words appeared in a 
document, and the appearance frequency of words was evaluated by tf-idf weighting scheme. For 
the details of the scheme, we refer the reader to Section 6.2 of [19 j . 

On each data set, we manually classified the data into groups under predefined criteria such 
that those become the disjoint partitions of the data set. The groups manually constructed are 
referred to as classes in contrast to clusters returned by an algorithm. In case of image data sets, 
the data were classified according to object, human model, or digit. In case of document data 
sets, the data were classified according to topic group. Table Q] summarizes the type of data, size 
of data matrix, and number of classes in the data sets. 

Two measurements were used for the evaluation of clusters constructed by an algorithm. One 
is accuracy (AC) and another is normalized mutual information (NMI). Let fli,..., be classes 
for a data set, and C i,.. . ,C r be clusters returned by an algorithm for the data set. In AC, we 
compute the correspondence relationship between Di,..., O r and C\,... ,C r to maximize the total 
number of common elements |Dj n Cj\. Such a correspondence can be obtained by solving an 
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Table 1: Data type, data matrix size, and the number of classes in the data sets. 



Data type 

Matrix 

d 

size 

m 

# Classes 

COIL20 

Object image 

16384 

1440 

20 

JAFFE 

Facial image 

65536 

213 

10 

ORL 

Facial image 

10304 

400 

40 

MNIST 

Digit image 

784 

10000 

10 

USPS 

Digit image 

256 

9298 

10 

ReutersTOPIO 

News article 

18933 

7285 

10 

BBC 

News article 

9635 

2225 

5 


assignment problem. The indices of the classes and clusters are reattached to follow the obtained 
correspondence order. Then, AC is defined as 

— (|fii n C\ ! + ••• + |D r n C r |). 

m 

In NMI, we compute the mutual information /(D,C) for fl and C, and the entropies 17(12) and 
H(C) for each of 12 and C. Here, 12 and C denote {12i,..., 12 r } and {Ci,... ,C r }. Then, NMI is 
defined as 

i(n,c) 

WmTmcyj' 

We refer the reader to Section 16.3 of m for the precise definitions of mutual information and 
entropy. Both measurements take the values ranging from 0 to 1. If clusters and classes are similar 
to each other, the values are close to one; otherwise, those are close to zero. 

7.2 Illustration to See the Relationship between Clustering Performance and 
Observation [X] 

The clustering performance of NCER can be expected to be high if Observation Q] holds. Experi¬ 
ments were conducted with the purpose of illustrating this. We visualized the products produced 
by NCER on a small data set to see whether Observation Q] holds or not. The data set was con¬ 
structed by picking image data in three classes of MNIST corresponding to handwritten digits 4, 
5, and 6. The data set contained 2832 images of 28-by-28 pixels. We conducted NCER by using 
four different neighbor numbers p, namely, 5, 944,1888, and 2832, chosen so as to divide the range 
from 0 to 2832 into three almost equal parts. The other input parameters were set such that r is 
3, and k is the inner product of two data points. 

Table [2] and Figure [2] display the experimental results. The table summarizes the ACs and 
NMIs for the four neighbor numbers. The figures may need some explanation. The size of the 
data matrix is 784-by-2832. The vectors Pi,...,P 2832 constructed in the step 1 of NCER are 
3-dimensional vectors due to r = 3 and lie on a hyperplane T~L = {x € R 3 : ejx = r}. The MVEE 
for the set S = {±pi,..., ±£> 2832 } is 3-dimensional and is centrally symmetric, having the origin 
as the center. The figures show the intersections of pi,... ,£>2832 and MVEE with T~L. The figures 
display four cases for each neighbor number. The colored points correspond to pi,... , P 2832 , and 
red, blue and green respectively represent the three classes. The ellipsoids surrounded by the black 
lines are the MVEEs, and the squares surrounds active points. 
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Figure 2: Visualization of p±, , P 2832 and MVEEs on a hyperplane These were produced using 
NCER for a data set consisting of three classes in MNIST and choosing four different neighbor 
numbers p. The red, blue, and green points correspond to the three classes for the handwritten 
digits 4, 5, and 6. The ellipsoids surrounded by the black lines are the intersections of MVEEs for 
the set S = {±pi, ... ,±£> 2832 } with 7~L. The points in the squares are active points. The case of 
p = 5 is at the top-left corner, that of p = 944 at the top-right corner, p = 1888 at the bottom-left 
corner, and p = 2832 at the bottom-right corner. 


Table 2: Cluster evaluation obtained by NCER for four different neighbor numbers p. 

p = 5 p = 944 p = 1888 p = 2832 
AC 0.987 0.829 0.546 0.799 

NMI 0.934 0.496 0.258 0.460 


We can say from the figures that Observation |T] holds in the case of p = 5. We see from the table 
that the AC and NMI of this case are high. Meanwhile, it is difficult to say that Observation [T] 
holds in the other cases. In particular, the figure with p = 1888 displays that three types of data 
points are mixed and not separated. In fact, the AC and NMI of this case are low. We see from the 
figures with p = 944 and p = 2832 that the data points group around three active points although 
some of the points are mixed. Hence, the ACs and NMIs of these cases are higher than those of 

p = 1888. 
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7.3 Performance Evaluation 


As mentioned in Section [5j the existing algorithms, NC, NMF, and GNMF, require the initial 
points to be chosen before they are run, and the choice affects the clustering performance. Hence, 
their performance may deteriorate with a choice of initial points which bring a low performance. 
Meanwhile, NCER does not have such issues. Hence, we can expect that its performance will be 
stable and as high as that of NC. 

Experiments were conducted in the purpose of comparing the performance of NCER with the 
existing algorithms. In the experiments of the existing algorithms, we chose multiple initial points 
and measured the worst, best, and average performance. We randomly generated 100 initial points 
for running the K-means algorithm in NC and GNMF and 100 initial points for running the BCD 
framework to minimize the functions of (11411 and (jl5f) in NMF and GNMF. By using these initial 
points, we performed a total of 100 trials for each of NC and NMF and a total of 10000 trials for 
GNMF on each data set. Meanwhile, we carried out one trial for NCER on each data set. The 
parameters of NCER and NC were set such that p is 5, r is the number of classes in the data set, 
and k is the inner product of two data points. We used the COIL20, JAFFE, ORL, MNIST, and 
USPS image data sets. 


Table 3: Performance evaluation of the algorithms in AC. 



NCER 

ave 

NC 

min 

max 

ave 

NMF 

min 

max 

ave 

GNMF 

min 

max 

COIL20 

0.831 

0.585 

0.386 

0.753 

0.574 

0.449 

0.680 

0.681 

0.465 

0.817 

JAFFE 

0.948 

0.753 

0.469 

0.981 

0.203 

0.174 

0.230 

0.789 

0.531 

0.981 

ORL 

0.792 

0.667 

0.570 

0.745 

0.595 

0.515 

0.662 

0.645 

0.555 

0.700 

MNIST 

0.663 

0.664 

0.500 

0.810 

0.522 

0.429 

0.532 

0.632 

0.449 

0.727 

USPS 

0.802 

0.710 

0.395 

0.937 

0.683 

0.563 

0.756 

0.750 

0.289 

0.940 


Table 4: Performance evaluation of the algorithms in NMI. 



NCER 

ave 

NC 

min 

max 

ave 

NMF 

min 

max 

ave 

GNMF 

min 

max 

COIL20 

0.933 

0.813 

0.695 

0.887 

0.712 

0.666 

0.748 

0.857 

0.774 

0.919 

JAFFE 

0.938 

0.874 

0.721 

0.974 

0.109 

0.068 

0.141 

0.910 

0.794 

0.974 

ORL 

0.875 

0.837 

0.787 

0.869 

0.788 

0.759 

0.811 

0.825 

0.797 

0.842 

MNIST 

0.755 

0.735 

0.642 

0.779 

0.460 

0.426 

0.467 

0.719 

0.646 

0.745 

USPS 

0.834 

0.800 

0.622 

0.877 

0.620 

0.566 

0.641 

0.813 

0.574 

0.881 


Tables [3] and [4] summarize the ACs and NMIs of the algorithms on each data set. Since multiple 
trials for NC, NMF, and GNMF were conducted, the statistics of the measurements are shown. 
The columns labeled “ave”, “min”, and “max” list the average, minimum, and maximum values 
of the corresponding measurements. For each data set, we compared the AC and NMI of NCER 
with the average ACs and NMIs of NC, NMF, and GNMF, and the tables show the highest values 
in boldface. We can see that the ACs and NMIs of NCER are higher than the average ACs and 
NMIs of the existing algorithms, except MNIST. NC and GNMF outperform NMF. For JAFFE 
and USPS, the maximum ACs and NMIs of NC and GNMF are higher than the ACs and NMIs of 
NCER, but their minimum ACs and NMIs are considerably lower than their maxima. Hence, the 
averages get worse. For COIL20 and ORL, we can also see that NC and GNMF have gaps between 
the minimum and maximum values of ACs and NMIs. For MNIST, although the AC of NCER is 
lower than the average AC of NC, the difference is quite small. Furthermore, the minimum AC of 
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NC is lower than the AC of NCER. This indicates that there exists an initial point such that NC 
is inferior to NCER in AC. Consequently, we see that the performances of NC, NMF, and GNMF 
depend on the choice of the initial points and the average performance tends to get worse because 
some initial points result in poor performance. Meanwhile, NCER is a stable clustering algorithm 
and has high performance. 

Let us mention the computational times of algorithms. The experiments showed that although 
the computational time of NCER is longer than that of NC, the difference is not so large. The 
biggest difference was in the case of USPS; NCER spent 81.4 seconds, while NC spent 67.0 sec¬ 
onds on average. It should be noted that the experimental results showed that the bottleneck in 
computational time is in solving the eigenvalue problem. 

7.4 Connection of NCER with MER and ER 

Finally, experiments were conducted to see how the performance of NCER varies with the neighbor 
number p. We used the COIL20 and MNIST image data sets and the ReutersTOPIO and BBC 
document data sets. 

We set the input parameters for NCER as follows, p were chosen so as to increase by some 
unit size s such that p € {5, s, 2s ,..., cs , m}. Here, s and c are integers such that s is strictly 
greater than 5 and c satisfies cs < m < (c + l)s. The unit sizes s of each data set were set as 
follows: s = 30 on COIL20, s = 200 on MNIST, s = 200 on ReutersTOPIO, and s = 50 on BBC. 
The other input parameters were set such that r is the number of classes in the data set, and k is 
the inner product of two data points. We set the parameters for MER and ER as follows. Let D 
be the matrix of (|7|) . The data matrix A was scaled to AD -1 / 2 , and the scaled data matrix was 
used as the input for the algorithms. The matrix S in step 2 of the algorithms was set as D^ 1 / 2 . 
The input parameter r was chosen as the number of classes in a data set. 

None of the data sets contained any data that corresponded to a zero vector, and the data 
matrices A satisfied r < rank(A). Hence, Assumptions [21 [3J and 0] held when the neighbor number 
p for NCER was set as m. Accordingly, Theorem Q] and Corollary Q] ensured that, when p = m, 
the clusters returned by NCER would not be far from those returned by ER and would coincide 
with those returned by MER. 

Figure [3] depicts the graphs for showing the ACs and NMIs of NCER, MER, and ER versus 
neighbor number p. The red points connected by the red line plot the ACs and NMIs of NCER. 
The blue and green lines are the ACs and NMIs of MER and ER. Table [5] summarizes the ACs 
and NMIs of NCER with p = m, MER, and ER. We can see from the figure and table that the 
ACs and NMIs of NCER coincide with those of MER when p = m. 

The graphs on the COIL20 and MNIST image data sets indicate that the clustering performance 
of NCER increases as the neighbor number p gets close to zero. However, the graphs on the 
ReutersTOPIO and BBC document data sets indicate that the performance of NCER deteriorates 
when p is a small number close to zero. This difference may have come from the differences in the 
degree of similarity in each class of the data sets. The data in the same class are quite similar to 
each other in the image data sets. For instance, regarding COIL20, the image data in a class were 
taken by turning an object through some interval of degrees. Thus, the image data in the same 
class may retain a similarity structure even as p gets close to zero. However, this may not be the 
case in the document data sets, since the data in the same class are not as similar to each other 
as those of the image data sets. 
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Figure 3: AC and NMI of NCER, MER and ER versus neighbor number p. The graphs from the 
top to bottom show the ACs and NMIs of the algorithms on COIL20, MNIST, ReutersTOPIO, 
and BBC. The left graphs are the ACs, and the right graphs are the NMIs. The horizontal axis 
is the neighbor number p, and the vertical axis is the measured value of AC and NMI. The red 
points connected by the red line are for NCER. The blue and green lines are for MER and ER, 
respectively. 
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Table 5: ACs and NMIs of NCER with p = m, MER, and ER on the data sets. 

NCER with p = m MER ER 



AC 

NMI 

AC 

NMI 

AC 

NMI 

COIL20 

0.376 

0.602 

0.376 

0.602 

0.444 

0.639 

MNIST 

0.362 

0.347 

0.362 

0.347 

0.378 

0.377 

ReutersTOPIO 

0.790 

0.641 

0.790 

0.641 

0.791 

0.647 

BBC 

0.926 

0.803 

0.926 

0.803 

0.926 

0.803 


8 Concluding Remarks 

We developed the NCER algorithm for spectral clustering; it is a variant of the normalized cut 
algorithm of Shi and Malik and Ng et al. The similarity with the ER algorithm for a separable 
NMF was discussed. In particular, if we modify one step of ER, the final outputs of NCER and 
the modified version of ER, called MER, coincide if we place assumptions on the data points and 
input parameters. Experiments indicated that NCER is a stable clustering algorithm and has high 
performance. They also showed how NCER behaves when the neighbor number p was varied. The 
results confirmed our theoretical insight that NCER is connected with MER when p is set to be 
equal to the number of data points. 

Finally, we should mention the issues which will be addressed in future research. In the MVEE 
computation, we used a cutting plane framework to accelerate the efficiency of the interior-point 
algorithm. Thanks to the hybrid of interior-point algorithm and cutting plane algorithm, we 
could handle large problems. However, there is no theoretical guarantee that the hybrid algorithm 
terminates after a finite number of iterations. The experiments showed that it does not achieve 
the stopping criteria under some parameter settings even after many iterations. Therefore, we 
might want to consider alternative approaches for the computation. For instance, the conditional 
gradient algorithm, which is also referred to as the Frank-Wolfe algorithm, for the dual of MVEE 
formulation R(5) is a promising approach; [1] reports encouraging experimental results. The 
memory requirements of the algorithm are not so large, and thus, it should be able to work on 
large problems though it needs more iterations than the interior-point algorithm does. 
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